NetMIM: network-based multi-omics integration with block missingness for biomarker selection and disease outcome prediction

Abstract Compared with analyzing omics data from a single platform, an integrative analysis of multi-omics data provides a more comprehensive understanding of the regulatory relationships among biological features associated with complex diseases. However, most existing frameworks for integrative analysis overlook two crucial aspects of multi-omics data. Firstly, they neglect the known dependencies among biological features that exist in highly credible biological databases. Secondly, most existing integrative frameworks just simply remove the subjects without full omics data to handle block missingness, resulting in decreasing statistical power. To overcome these issues, we propose a network-based integrative Bayesian framework for biomarker selection and disease outcome prediction based on multi-omics data. Our framework utilizes Dirac spike-and-slab variable selection prior to identifying a small subset of biomarkers. The incorporation of gene pathway information improves the interpretability of feature selection. Furthermore, with the strategy in the FBM (stand for ”full Bayesian model with missingness”) model where missing omics data are augmented via a mechanistic model, our framework handles block missingness in multi-omics data via a data augmentation approach. The real application illustrates that our approach, which incorporates existing gene pathway information and includes subjects without DNA methylation data, results in more interpretable feature selection results and more accurate predictions.


Introduction
Recent rapid developments of high-throughput technologies have made it feasible to explore an individual patient's genome through long lists of genetic, epigenetic, and transcript features.Consequently, integrative analysis based on such multi-level omics data, also known as vertical integration, has gained significant attention as a means of understanding the fundamental mechanism of disease and pathologies [1,2].Compared with research focused on a single type of genomic alteration, an integrative framework simultaneously considers the regulatory mechanism between different omics data, thereby enhancing our comprehension of the causality of disease and the interaction between environments and patients [3,4].
Multi-omics data integration methods have been used widely for several crucial tasks in bioinformatics.For the clustering task, several methods have been developed, including iCluster [5] and its variants [6,7], Bayesian consensus clustering [8], and BayesianTWL [9], to cluster cancer patients based on multi-omics data.For association studies, frequentist approaches like collaborative regression [10], canonical variate regression [11], LRM-SVD [12], DFNForest [13], and deep learning methods [14] are utilized for outcome prediction, classification, and feature selection through penalty-based techniques.In the Bayesian framework, iBAG [15], FBM [16], and a Bayesian negative binomial mixture regression model [17] are employed to investigate the regulatory pattern between gene expression and DNA methylation, as well as the association between gene expression and clinical outcomes.
One major challenge in multi-omics integration is the high dimensionality of measured features.Most of the abovementioned methods tackle the high-dimensionality problem through dimension reduction techniques [18] or feature selection techniques.Feature selection is enabled by popular methods like penalization methods such as LASSO [19], elastic net [20], and fused LASSO [21], as well as Bayesian variable selection priors like Bayesian LASSO [22] and spike-and-slab priors [23,24].To evaluate the performance of different methods, some benchmark studies have also been developed [25][26][27].
Although feature selection methods have been successful, few of the aforementioned techniques consider the known structural dependencies among genetic features.Consequently, the results obtained are hard to interpret in downstream enrichment analysis, as genes in the same pathway often have similar cellular functions.To address these limitations and obtain more consistent results, several methods have been developed that incorporate known network or pathway information.For example, Gao et al. (2019) [ 28] identified significant genetic features via a networkconstrained regularization, while Stingo and Vannucci (2011) [29] incorporated pathway information through the Markov random field (MRF) prior to the variable selection of discriminant analysis.Li et al. (2022) [30] also incorporated prior network information through penalization in the clustering of messenger RNA (mRNA) expression data.However, these frameworks mainly focus on single omics data, and a unified framework that integrates known network and pathway information for multi-omics data is still limited.
Another challenge in multi-omics integrative analysis is the high proportion of missing data, particularly in the block missing structure (Fig. 1), where only a portion of the subjects are measured in all types of platforms [16].For example, in The Cancer Genome Atlas (TCGA) kidney cancer study's kidney renal clear cell carcinoma (KIRC) project, 532 cancer subjects were measured in gene expression, but only 319 cancer subjects were measured in DNA methylation with the 450k array.Most integrative approaches only consider the complete data case by removing subjects without all omics data types.However, this naive approach decreases the sample size, resulting in reduced statistical power, particularly when the number of types of omics data is high.To address the block missing problem, integrative imputation techniques are conducted before model fitting, which utilize the correlations and shared information among multi-omics data sets, such as canonical correlation analysis [31] and multiple imputations of multiple factor analysis [32].In contrast, a Bayesian framework can offer a unified framework for handling missing data.In the Bayesian analysis dealing with missing data, one approach involves modeling the marginal distribution of observed data after integrating out the missing observations when the missing mechanism is missing at random [33].The other approach, data augmentation, samples missing data from their conditional posterior distribution [34,35].For example, FBM [16] employs data augmentation to accomplish the block missing problem.However, it shows low efficiency in feature selection and lacks interpretability due to the absence of dependency structure information among the features.To the best of our knowledge, there is no unified framework that handles missing data while also serving the networkbased feature selection and model prediction in multi-omics data integration.
Motivated by the challenges in multi-omics data analysis and the iBAG model, which constructs the relationship between gene expression and DNA methylation via a mechanistic model, we proposed the Network-based Multi-omics Integration with Missingness (NetMIM) method to address the interpretability problem and situations where only partial subjects have all types of omics data.Compared with another iBAG-driven model (FBM), which only focuses on continuous response and provides unsatisfactory feature selection results, our proposed framework makes several contributions.Firstly, it utilizes Dirac spike-and-slab variable selection prior to identifying a small subset of biomarkers associated with clinical outcomes.Secondly, it incorporates the known structural dependencies among biomarkers into feature selection via an MRF prior, improving the interpretability of feature selection.In the simulation analysis, we show that the informative prior can improve the performance of feature selection and model prediction.Lastly, the novel unified framework is extended to handle binary response and survival response while effectively addressing missing data through a data augmentation approach.
The article is structured as follows.In Section 3, we introduce our Bayesian integrative analysis model and the inference of parameters.Section 3 comprises extensive simulations and comparisons with other methods in various scenarios.In Section 3, we evaluate the proposed method on multi-omics data with continuous response.In Section 3, we present real applications to KIRC and lung adenocarcinoma (LUAD) datasets from TCGA, where a proportion of subjects have no DNA methylation.Finally, in Section 3, we provide our conclusions and discussions.

Methods
Let Y N denote the clinic outcome of interest for a total of N patients, the type of which could be a continuous response, dichotomous response, and survival response.For the n-th patient, (m n1 , m n2 , . . ., m nJ ) represents the measurement of J DNA methylation probes and (e n1 , e n2 , . . ., e nK ) represents the mRNA expression levels for K genes.Apart from genomic features, L clinic covariates, including age, gender, and tumor stage, are denoted as (c n1 , c n2 , . . ., c nL ).Hence, all the data can be denoted as {Y N , M N×J , E N×K , C N×L } in matrix notation.

Model
The structure of NetMIM is motivated by the iBAG model [15] which integrates multi-omics data according to the biological mechanism.It is known that molecular features measured at the transcript level (e.g. mRNA expression) affect clinical outcomes more directly than molecular features measured at the DNA/epigenetics level (e.g.DNA methylation).Molecular features measured at the DNA level affect clinical outcomes by inf luencing mRNA expression.Under the guidance of these biological mechanisms, the iBAG model partitions gene expression into different (independent) units and uses this to identify genes relevant to clinical outcomes as modulated by other DNA level features, including DNA methylation.
NetMIM contains a two-layer hierarchical structure (Fig. 2).The first layer is a mechanistic model to infer the effect of DNA methylation on gene expressions.In the mechanistic model, the gene expressions are decomposed into the part modulated by methylation (E M ) and the part controlled by other regulators (E M).
where = (ω jk ) J×K : ω jk is the "gene-methylation" effect, indicating the effect of j-th methylation probe on the k-th gene.The second layer is a clinical model to assess the relationship between clinical outcome and genomic information obtained in the mechanistic model.For continuous response, the model is where The error term is assumed to follow the Gaussian distribution, given by N(0, In the original iBAG model, all methylation sites/probes within the promoter region of a given gene are summarized to generate the gene's single methylation level.However, since not all methylation probes regulate gene expression, it is more appropriate to include methylation probes in the promoter region with a shrinkage prior on the regression coefficients.Therefore, we specify a spike-and-slab prior on the "gene-methylation" effect : where z jk is the binary indicator for j-th methylation probe, σ 2 k is the variance of mechanistic model for k-th gene, and δ 0 (•) is the Dirac delta function.A simple independent Bernoulli prior, z jk ∼ Bern(π k ), is assigned, where π k could be either treated as a hyperparameter or a random variable.We assign a Beta distribution Beta(a, b) on π k , leading to a Beta-Binomial prior on the number of effective methylation probes.Due to the high dimensionality of genes, it is necessary to consider variable selection approaches in the clinical model.We specify a similar Dirac spike-and-slab prior [17] on type M effect and type M effect to induce sparsity: where γ M k and γ M k are binary indicators, indicating whether or not there exists type M effect or type M effect for k-th gene.Compared with another iBAG-driven method FBM [16], which utilized a continuous spike-and-slab distribution for same level variable selection in Equation ( 3) and ( 4), our model can achieve zero coefficient exactly with Dirac distribution.More interestingly, the continuous spike-and-slab prior tends to falsely include E M variables, preventing to identify the effect of E M (see Fig. 4).
Although it is feasible to assign an independent Bernoulli prior on the indicators, recent contributions in Bayesian variable selection of modeling genomic features have incorporated the external information about dependencies among variables via MRF priors [17,29].Hence, we specify an MRF prior on γ , considering dependencies among genes that are represented by a gene-gene interaction network extracted from the KEGG database.It is given by where where G is p × p binary symmetric matrix representing the links in the gene network.If gene j and l are linked in the network, we have g jl = 1; otherwise g jl = 0. Since the expression of gene k , which means that any one of components being effective implies the gene k is effective in the model.We complete the model by assigning an inverse-Gamma prior on the variance parameters in the model given by σ

Model with missingness
In Bayesian frameworks, when data are missing at random, one approach is to model the marginal distribution of observed data by integrating out missing data.The other approach is the data augmentation method, which imputes the missing values from the conditional distribution of observed data and current parameters in each iteration.Following the strategy of handling missing data in the FBM model [16], the NetMIM model utilizes the data augmentation method to handle missing omics data types. Let For the expression of k-th gene in the missing data, we specify the following model: where J k is the set of methylation probes mapped to the k-th gene.
Since we only consider the methylation probes in the promoter region of each gene, the methylation probes are many-to-one mapped to genes, meaning that many methylation probes are mapped to one gene, but one methylation probe is only mapped to one gene.For the DNA methylation of the j-th probe in the missing data, we impose the imputation model by where (σ m j ) 2 is the variance for the j-th methylation probe.(σ m j ) 2 = 1 if the M value of DNA methylation data is normalized with mean 0 and variance 1.

Model for discrete and survival outcomes
Following the latent variable formulation in the iBAG [15], the NetMIM model can be easily extended to model discrete and censored outcomes.Specifically, when the clinical response is a binary variable taking values of 0 or 1, we can augment a latent continuous variable Y * via the probit model.The relationship between the latent continuous variable Y * n and Y n can be expressed as for n = 1, . . ., N. The response in Equation ( 2) is replaced by Y * and parameter representations and corresponding interpretations remain the same as those for continuous outcomes.
If the clinical outcome of interest is patient survival time (with censoring), we use the accelerated failure time model [36].For right-censored response variable Y n = (y n , δ n ), where y n = min(t n , c n ) is the minimum value between survival time t n and censoring time c n of patient n, and δ n = I{t n ≤ c n } is event indicator.The relationship between augmented continuous variable Y * and response variable Y can be expressed as The full conditionals and the Markov chain Monte Carlo (MCMC) sampling schemes for discrete and survival responses are provided in the Supplementary Material.

Model fitting and posterior inference
We detailed an MCMC algorithm based on the stochastic search for variable selection and Gibbs sampler for parameter estimation in the Supplementary Material.For posterior inference, our primary interest lies in the selection of genes associated with clinical response, as captured by the indicators γ M and γ M. To summarize the posterior distribution of model selection indicators, one approach is to use maximum-a-posterior estimates.However, due to the large model space, an alternative approach is preferred, which involves thresholding the estimated marginal posterior probability of inclusion (PPI) of a single feature [17].
The PPI is obtained as the proportion of included MCMC iterations after burn-in where the corresponding feature is selected within the model.For the choice of threshold, the procedure to control the Bayesian false discovery rate proposed by Newton et al. (2004) [37] could be applied.It is suggested that a threshold equal to 0.5 often results in a reasonable Bayesian FDR, so we follow their rule by selecting features with PPI larger than 0.5 [38].
Our second interest is the prediction of clinical response using the Bayesian model average approach.Given a new sample i , where βc i , βM i , β M i , and ˆ i are the simulated parameters from the i-th MCMC iteration.We average the estimation from , where T b is the burn in iteration.For continuous response, the predicted mean square error (PMSE) will be calculated by where N v is the sample size of the validation data set.We use the area under the curve (AUC) of the receiver operating characteristic, which is a plot of the true positive rate (TPR) versus the false positive rate (FPR), to evaluate the effectiveness of feature selection.We utilize the PPI values of the features in simulations where the underlying truth is known.By summarizing the relationship between TPR and FPR under different PPI thresholds, the AUC provides a more accurate measurement for feature selection, ranging from 0 to 1.A higher AUC indicates a more accurate feature selection.To assess the performance of outcome prediction, we calculate the PMSE on the independent test datasets in simulations and the real application with continuous response.A smaller PMSE indicates a better outcome prediction.In the real application where the clinical response is a right-censored variable associated with the patient's survival time (KIRC and LUAD), we utilize the concordance index (C-index) [39] as the metric to evaluate outcome prediction.The C-index is defined as the proportion of patient pairs in which the predictions and outcomes are concordant.

Simulation schemes
To evaluate the performance of our NetMIM model, we conduct several simulations based on full omics data and missing omics data according to the schemes in FBM [16].Specifically, the clinical covariate matrix is simulated from N(0, 1) with N = 100 and L = 3.The number of genes K ∈ {100, 200, 500}.To preserve the realistic pattern of correlation in DNA methylation, we randomly select J = 300 methylation probes from DNA methylation of real data.The average number of methylation probes mapped to each gene is 3.Each methylation probe is randomly allocated to a gene with the constraint that each gene contains at least one methylation probe.In the mechanistic model, 50% of genes are randomly selected to be significantly regulated by their methylation probes, and z jk ∼ Bern(1/2) for j ∈ J k , the methylation probe set in the promoter region of gene K, when the k-th gene is significantly regulated by DNA methylation.For the selected methylation probes, the coefficient w jk is sampled from Unif(0.5, 1) or Unif(−1, −0.5), and the regulated part In the clinical model, the first five genes are selected to inf luence the clinical outcome through both the part regulated by methylation and the part not regulated by methylation.The second five genes are selected to impact the outcome only through the part regulated by DNA methylation, and the third five genes impact through the part not regulated by methylation.It implies that both γ M and γ M have 10 out of K genes equal to one, and the remaining are zeros.We simulate the graph structure by sampling edges among effective genes from Bern(0.1) and sampling the same number of edges among ineffective genes.For the effective genes, the regression coefficients in the clinical model are sampled from Unif(1, 1.5) or Unif(−1.5, −1).
The coefficients of clinical covariates and intercept are set as 1, and the variance of clinical model σ 2 ∈ {1, 4, 9} is used to generate outcome Y.In each simulation, 100 independent samples are generated as the validation dataset.
In scenario I, we evaluate the proposed method (NetMIM), the proposed method without MRF prior (NetMIM w/o MRF), the FBM method [16], FBMcorr (FBM method considering correlation in DNA methylation), and two mimic frequentist methods, mimic lasso and mimic elastic net, in the simulated full omics data with varying variances and different numbers of genes.In NetMIM without MRF, we set the hyperparameter f = 0 in the MRF prior.The mimic lasso and mimic elastic net methods are implemented by modeling each gene expression by the corresponding methylation probes via lasso [19] and elastic net [20] regression, respectively.After the parts of gene expression regulated by methylation are obtained from the previous procedure, the clinical model is also modeled through lasso and elastic net regression.The hyperparameters in the lasso and elastic net regression are determined by a five-fold cross-validation scheme.In scenarios II to IV, data with missingness are generated after full omics data are simulated [16].We evaluate NetMIM, NetMIM without MRF, FBM, NetMIM on complete data (NetMIM_CC), NetMIM without MRF prior on complete data (NetMIM_CC w/o MRF), and FBM on complete data (FBM_CC) using the simulated data with missingness.The methods on complete data mean that the methods are implemented only on the subjects with both gene expression and DNA methylation (remove the subjects with only one type of omics data).
For prior specification, the hyperparameters in the MRF prior are set as d = −3 and f = 0.5, which implies that the prior probability of a gene without neighborhoods is exp(−3) 1+exp(−3) ≈ 0.05, encouraging a sparse model.To avoid the phase transition problem, we prefer a small f value and 0.5 is also common in other methods adopting MRF prior in the variable selection [40].For the scale parameters in the spike-and-slab prior, we set [41].We refer to the sensitivity analysis results reported in the Supplementary Material for more details.As for the beta prior on the methylation probe selection, we set a = 0.2 and b = 0.8, indicating 20% expected prior probability of inclusion.Vague priors are assigned for the variance in the model with δ 1 = δ 2 = 0.001.

Results
Figure 3 illustrates the AUC of gene selection results in the clinical model and the outcome prediction performance by PMSE on the validation dataset for scenario I.The proposed method, NetMIM, consistently outperforms the FBM method and the two other frequentist methods in variable selection and outcome prediction.Furthermore, the proposed method that incorporates structure information among genes achieves the best performance, better than NetMIM without MRF prior, especially for a higher number of genes.This observation suggests that incorporating extra structure information can enhance the model's training.Figure 4 displays the posterior probability of gene inclusion for the three Bayesian methods, NetMIM, NetMIM without MRF, and FBM.When incorporating the network structure, the signals of effective genes are higher.For the FBM method, the average posterior probability of inclusion of γ M is higher than that of γ M , making it challenging to select γ M and resulting in a lower AUC in feature selection.The reason for the phenomenon is that the estimated variance in the slab distribution for E M features was smaller than that for E M features in the FBM model.Higher variances will introduce a greater penalty on the regression coefficients.As a result, they are more likely to shrink toward zero.
In scenario II, data with missingness are generated with different ratios of subjects whose gene expression is missing.Figure 5 displays the AUC of gene selection and outcome prediction in scenario II.In most cases, incorporating MRF prior improves the model performance, leading to better feature selection and model prediction (The blue line is better than the corresponding orange line).Furthermore, NetMIM_CC performs better than NetMIM with or without MRF prior.It is better to remove subjects without gene expression.The augmentation of gene expression introduces more uncertainty for feature selection and model prediction.In scenario III, data with missingness are generated with different ratios of subjects whose DNA methylation is missing.Figure 6 demonstrates the AUC of gene selection and outcome prediction in scenario III.Similarly, incorporating MRF prior leads to better feature selection and model prediction (The blue line is better than the corresponding orange line).When the missing ratio is small, the methods (both with MRF and without MRF) using complete data perform better, but the methods incorporating missing data perform better when the missing ratio is high.The inf luence of missing ratio on model performance is similar to a bivariate normal distribution, which we explored in Section 3 of the Supplementary Materials.In scenario IV, half of the total subjects are missing in gene expression or DNA methylation, with different proportions of subjects with missing gene expression.In this case, NetMIM performs the best, as shown in Figure S4.The AUCPRs of feature selection results in all simulations are shown in Figure S3 of Supplementary Materials, which also illustrate similar results as AUCs.
The question arises as to when we should include subjects without part of the omics data.Although we have explored the phenomenon that including subjects with missing data may sometimes decrease the model performance in the Supplementary Material, it is difficult to find an exact criterion to make a decision.Therefore, following the procedure in [16], we design a five-fold cross-validation scheme to determine whether to include those subjects or not.Specifically, we divide the training data into five equal parts with the same missing ratio.Each time we use one part as the validation set, and use the remaining as the training data set.NetMIM is trained on the whole subjects but NetMIM_CC is trained only on the subjects with complete data.The PMSE on the validation set for these two methods is computed from the subjects with complete data.The procedure is repeated across all five parts, and the method with the smaller average PMSE is selected as the final training method.Finally, the selected method is trained on the original training data set.This cross-validation scheme is denoted as NetMIM_CV.We also conduct some simulations for NetMIM, NetMIM_CC, and NetMIM_CV.Figure 7 shows that NetMIM_CV can achieve good performance in both cases where NetMIM performs better in case  1, but NetMIM_CC performs better in case 2 (case 1 is 50% missing ratio with K = 100 in scenario III, and case 2 is 50% missing ratio with K = 100 in scenario II).

Evaluation on continuous response
The dataset is a public dataset (GSE65205) from a case-control study of atopic asthma and nasal epithelial DNA methylation in 72 predominantly African American children [42], with complete DNA methylation data from Illumina 450k chips and mRNA expression from Agilent-028004 SurePrint G3 Human GE 8x60K Microarray data.We used the M-value for methylation level for better model fitting.We are interested in the genes with methylation probes in the promoter regions and belonging to 20 pathways in the enrichment analysis of differentially expressed genes for another independent nasal epithelial data [43], resulting in 689 genes and 1844 methylation probes.To construct the interaction network G, we set g jl = 1 if there exists a direct interaction in a pathway for gene j and gene l and g jl = 0 otherwise.Serum Immunoglobulin E (IgE) level is a primary clinical outcome in children's asthma studies.We take log-transformed IgE level as our clinical response with age and gender as clinical variables.We consider five scenarios: (i) missing gene expression 20% (E:20%), (ii) missing gene expression 40% (E:40%), (iii) missing DNA methylation 20% (M:20%), (iv) missing DNA methylation 40% (M:40%), (v) no missing omics data (Full Set).We used 20 subjects as the test data in each scenario and repeated 10 times.
The hyperparameters to control the MRF prior are set as d = −4  and f = 0.5 to assign a prior probability of inclusion of genes as approximately 0.02.The other hyperparameters are the same as the simulation studies.Table 1 shows the RMSE of outcome prediction of different methods on the test dataset in four scenarios.The lasso and elastic net method is the integrative analysis's regression model concatenating mRNA expression and DNA methylation.NetMIM achieved the best performance under model prediction in the first three scenarios.Moreover, incorporating subjects with missing omics data improved model performance in NetMIM and FBM (NetMIM versus NetMIM_CC and FBM versus FBM_CC).The MRF prior also contributed to the improvement of model performance (NetMIM versus NetMIM w/o MRF and NetMIM_CC versus Net-MIM_CC w/o MRF).Finally, it is worth noting that the mimic methods demonstrated better performance compared with the vanilla methods, particularly in the case of the elastic-net method, which illustrates the contributions of the mechanistic model.

Evaluation on survival response
In this section, we present an application of the proposed method to KIRC data from TCGA data portal, with DNA methylation data from Illumina 450K chips and RNA-seq gene expression data.The dataset includes N = 532 kidney cancer patients, all of whom Table 1.Means and standard deviations (in parentheses) of PMSE on a test dataset of different scenarios for the GSE65205 dataset; mimic lasso, mimic elastic net, lasso, and elastic net are implemented on the complete dataset; the results are based on 10 replications.Figure 7. Boxplot of AUC and PMSE of different methods and CV scheme in two cases; the red point represents the mean value of the measure; Case 1 is 50% missing ratio with K = 100 in scenario III, and case 2 is 50% missing ratio with K = 100 in scenario II.

Method Scenario (i) (E:20%) (ii) (E:40%) (iii) (M:20%) (iv) (M:40%) (v) (Full Set
have gene expression data, but there are only N M obs = 318 subjects whose DNA methylation is measured, implying that almost 42% of the subjects do not have DNA methylation data.Regarding clinical variables, the survival time of patients is the response with age and gender as clinical covariates.For genetic features, we are interested in the genes belonging to 29 KEGG pathways in the enrichment analysis of differential genes for renal clear cell carcinoma, resulting in 1772 genes [44,45].The RNA-seq counts are transformed into continuous TPM (transcripts per million) values, and DNA methylation levels are represented by the Mvalue.We filter out genes with a mean expression level less than 10 or a standard deviation less than 5, resulting in K = 814 genes for the analysis [16].In addition, we select the methylation probes mapped to each gene in the promoter region, obtaining J = 6099 methylation probes.These types of regulations, with regulators close to the target, are called cis-regulation, as opposed to transregulation, where the regulators are far from the target gene.Due to the smaller effective size of trans-acting variants, it is more efficient to detect cis-acting variants with a relatively small sample size.In our model formulation, the last component, MRF prior on selection indicators γ , requires the interaction network G among K = 814 genes, which are extracted from the KEGG database with the R package KEGGgraph [46].If there exists a direct interaction in a pathway for gene j and gene l, then g jl = 1; g jl = 0 otherwise.To compare the performance of the proposed method including subjects with missing methylation data and complete data, we randomly split the complete data into training data (220) and test data (93) after removing subjects with survival time less than 30 days.The censoring proportions in each subset are the same (66%).With respect to prior specification, the hyperparameters to control the MRF prior are set as d = −4 and f = 0.5 to assign a prior probability of inclusion of genes as approximately 0.02.The other hyperparameters are the same as the simulation studies.

Results
To assess convergence, we ran four MCMC chains independently from different initial values with trace plots shown in Supplementary Material.We computed the pairwise Pearson correlation coefficients of the marginal posterior probability of inclusion for γ between different chains to check the consistency of selection results.The correlation coefficients of the posterior probability of inclusion ranged from 0.883 to 0.893 for γ and from 0.892 to 0.915 for Z among the four chains, demonstrating good convergence and consistent variable selection results of our model.Furthermore, we compared the performance of NetMIM, NetMIM_CC, and NetMIM_CV.To demonstrate the effect of MRF prior, we also implemented corresponding methods without MRF prior.As shown in Table 2, NetMIM performed best on the test dataset.The methods incorporating MRF prior achieved better model prediction performance.The FBM method is specifically tailored for continuous response variables and does not have a generalization for survival response.Therefore, we did not assess its performance on this particular real dataset for survival analysis.Instead, we compared NetMIM with recent survival analysis methods, including DeepOmix [47], BlockForest [48], and IPF_LASSO [49], whose configurations are detailed in the Supplementary Material.Regarding the efficiency, the computational time was 8.6 h, 2.4 h, 5.2 min, and 2.3 min for NetMIM, DeepOmix, BlockForest, and IPF_LASSO, respectively.NetMIN is time-consuming due to the computationally intensive MCMC algorithms and the complex mechanistic model, even though it was written by Rcpp to accelerate computation.All experiments were implemented on a high-performance computing server with E5 -2643 v4 CPU (20 M cache, 3.40 GHz) with 256GB memory.
Applying the NetMIM model to the training data with missingness, we identified 128 genes with only type M effect, 33 genes with only type M effect, and 14 genes with both type M effect and type M effect when utilizing the median probability model for selection (PPI cutoff = 0.5).This result implies that the 14 genes have effects modulated by both methylation and other mechanisms.Additionally, we identified 78 genes that are significantly

Biological findings
To validate our findings in biology, we investigated the gene symbols associated with survival time, particularly focusing on the 14 genes with both type M effect and type M effect.Among the 28 effects (14 type M effects and 14 type M effects), 75% were negatively associated with the survival time response.Specifically, KRAS had the maximum absolute regression coefficient β M = −0.23,which is a well-established tumor-driver gene involved in cancer initiation, development, and progression.MYC and THBS3 were also negatively related to survival time.MYC is an oncogene that is frequently amplified in cancer cells.For THBS3, recent studies have shown that the THBS family plays a crucial role in the development and progression of human cancer [50].More details for other genes are shown in the Supplementary Material.We next investigated whether the total 175 identified genes represented better functional annotation in a biological sense by employing databases for annotation visualization and integrated discovery (DAVID) [51].Annotation terms at a 0.05 threshold applied to adjusted P-values were selected [52].Under the KEGG pathway subcategory shown in Figure S11, the pathway Focal adhesion had a relatively large gene ratio and the smallest adjusted P-value (1.5 × 10 −140 ).The terms PI3K-Akt signaling pathway and pathways in cancer followed, with the second and third smallest adjusted P-values (7.8 × 10 −77 and 2.8 × 10 −73 , respectively).Proteins in the Focal adhesion are associated with cancer metastasis, which is responsible for as many as 90% of cancer-associated deaths in patients [53].The PI3K-Akt signaling pathway is an intracellular signaling pathway that is important in regulating the cell cycle and is a central regulator pathway of several cancers, such as ovarian cancer and breast cancer.We comment on interesting findings from these pathways and other functional terms in the Supplementary Material.
In addition to our primary analysis, we have also applied Net-MIM to the LUAD data obtained from TCGA data portal.Detailed information on this analysis can be found in the Supplementary Materials.In the LUAD analysis, we observed that the PI3K-Akt signaling pathway (P-value = 1.3 × 10 −45 ), Pathways in cancer (P-value = 5.9 × 10 −44 ), and Focal adhesion (P-value = 1.0 × 10 −43 ) were identified as the top three enriched pathways for the genes identified in the analysis.This finding further emphasizes the significance of these pathways in the development and progression of human cancers.The consistency of these enriched pathways across different datasets reinforces their importance and suggests their potential as therapeutic targets in cancer research.

Conclusion and discussion
The heterogeneity and high variability of omics data pose challenges to integrative analysis.Incorporating structural dependencies among genes or other genetic features can improve the performance of integrative analysis.However, only a few existing Bayesian hierarchical models that investigate the association between genes and clinical outcomes consider the structural information among genetic features, especially the pathway information among genes.Another characteristic of multi-omics data integration is the potential for a large proportion of missing data since not all patients contain all omics data.
To address these challenges, we propose a network-based multi-omics integrative framework with missingness to perform biomarker selection, outcome prediction, and handling of missing data.Simulation studies have demonstrated that incorporating pathway information among genes improves the performance of feature selection and model prediction.Extensive simulations with missing data indicated that including subjects with incomplete omics data is not always favored over the complete case.To decide on the method of handling missing data, a cross-validation scheme was proposed, which achieved similar performance to the better one.Our proposed method provides a comprehensive approach to handle multi-omics data with missingness and incorporate pathway information among genes.It can improve the accuracy and interpretability of prognostic biomarker selection and analysis.
Despite its remarkable performance, our framework has some limitations.The MCMC algorithm has a high computation burden, even with optimized R code using C++.Additionally, our framework only includes two types of omics data (DNA methylation and mRNA expression), limiting its practical application.In the future, we will explore developing a framework that integrates more types of omics data, including single-nucleotide polymorphisms and micro RNA [36].Furthermore, several extensions of our model are worth exploring.Firstly, considering the heterogeneity of cancers, we could extend the model to a finite mixture model, clustering patients to find cancer subtypes and estimating the number of clusters directly from the data.Secondly, we could not only propose a structural prior on the effects between DNA methylation and gene expression [54] but also consider the interaction effects of gene expression and DNA methylation on clinic outcomes.Finally, for the gene expression part regulated by other mechanisms, we could construct a model given other regulators' measurements are known, such as pre-defined transcription factors.

Key Points
• The Dirac spike-and-slab feature selection priors are utilized to identify efficient biomarkers, achieving zero coefficient exactly.
• The integrative framework incorporates the known structural dependencies among biomarkers into feature selection via a Markov random field prior.• Via data augmentation approaches, the integrative framework is generalized to different clinical outcomes, including continuous, binary, and right-censored responses.

Figure 1 .
Figure 1.Major missing pattern in multi-omics data; each row consists of blocks representing different omics data types from the same individual; the colored blocks highlight the areas where some omics data are missing for certain individuals.
the effect of DNA methylation on k-th gene expression; E M = (e M nk ) N×K = (E M 1 , . . ., E M K ):E M k denotes the effect of other regulators on k-th gene expression;

Figure 2 .
Figure 2. A graphical representation of the proposed model; each node in a circle is a parameter of the model and each node in a square represents an observable information; the solid line refers to a probabilistic direct dependence and the dashed line refers to a deterministic relationship; the hyperparameters are omitted.

M 1 ,
denotes the effects of clinical factors on the outcome; β M = (β M 1 , . . ., β M K ) denotes the effects of the part of the gene expressions regulated by DNA methylation, called type M effect; β M = (β . . ., β M K ) denotes the effects of the part of the gene expressions regulated by other mechanisms, called type M effect.
d and f are hyperparameters, γ −k denotes the vector of γ with k-th element excluded, and N k is the set of neighbors of kth element in the network.d is negative, encouraging sparsity in the model, and f is positive, indicating neighboring elements are jointly effective in the model.For the node without any neighbor, its prior distribution reduces to a Bernoulli prior with parameter π = exp(d)/(1 + exp(d)).The joint prior on γ is be the missing indicator of gene expression data and U M = (U M 1 , . . ., U M N ) be the missing indicator of DNA methylation; we assume U E i × U M i = 1 for ∀1 ≤ i ≤ N, indicating the patient should have at least one type of omics data.The full omics data contain the observed part and missing part, denoted, respectively, by

Figure 3 .
Figure 3. AUC and PMSE of different methods under different numbers of genes and model standard deviations, respectively; the number of genes K varies from 100, 200, 500; the x-axis represents the standard deviation σ of the clinical model.

Figure 4 .
Figure 4. Posterior inclusion probabilities of features for the three Bayesian methods when K = 500 and sd = 2; the x-axis is the feature index, among which the first 500 indices are E M variables and the second 500 indices are E M variables; the blue points represent the effective features and the red points represent the ineffective features; the PPI values for features are the average based on 50 replicated simulated data.

Figure 5 .
Figure 5. AUC and PMSE of different methods under different missing ratios for gene expression; the horizontal axis is the proportion of subjects without mRNA expression in the training data; the number of genes K varies from 100, 200, 500.

Figure 6 .
Figure 6.AUC and PMSE of different methods under different missing ratios for DNA methylation; the horizontal axis is the proportion of subjects without DNA methylation in the training data; the number of genes K varies from 100, 200, 500. )

Table 2 .
Means and standard deviations (in parentheses) of C-indices of training data and test data of different methods on KIRC data; DeepOmix, BlockForest, and IPF_LASSO are implemented on the complete data set; the analysis is repeated 10 times.